rocm: enable wmma indexer + nix flake + gfx1151 optimisation by alantsev · Pull Request #180 · antirez/ds4

alantsev · 2026-05-17T07:06:25Z

most of the changes are from the upstream main branch - the only files directly changed by this commit are -

M Makefile
M ds4_cuda.cu
M ds4_rocm.h
M ds4_server.c

.

the rocm related changes are about enabling the wmma indexer for hipcc build

the current tests and eval results:

$ ./ds4_test
long-context:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: long-context prefill 0/30474
ds4-test: long-context prefill 8192/30474
ds4-test: long-context prefill 16384/30474
ds4-test: long-context prefill 24576/30474
ds4-test: long-context prefill 30474/30474
long-context: OK
tool-call-quality:
ds4-test: tool-call quality fast path
ds4-test: tool-call quality exact path
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
tool-call-quality: OK
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive skipped (API/official graph mismatch)
ds4-test: vector long_code_audit
logprob-vectors: OK
metal-kernels:
ds4: CUDA registered 0.00 GiB model mapping for device access
metal-kernels: OK
server:
server: OK
ds4 tests: ok

The evaluation run

$ ./ds4-eval -m ds4flash.gguf --plain --questions 12 --tokens 2048 --temp 0 --seed 1
...

PASSED got 16 expected 16 (159.8s, 1437 tokens)
ds4-eval: 10/12 passed, 2 failed, runtime 00h:27m
#   state      prompt      gen    total given    correct  test
  1 PASSED        201     1661     1862 B        B        GPQA Diamond/recNu3MXkvWUzHZr9
  2 PASSED        149      370      519 C        C        SuperGPQA/001b51d76b4d422988f2c11f104a2c6c
  3 PASSED         81      623      704 70       70       AIME2025/aime2025-01
  4 FAILED        313     2048     2361 A        C        GPQA Diamond/recoiTJPGUmzAkief
  5 PASSED        272     2048     2320 J        J        SuperGPQA/b7e20eac98764fb0bf30e8366d951daa
  6 PASSED        146     1325     1471 468      468      AIME2025/aime2025-16
  7 PASSED        156     1303     1459 B        B        GPQA Diamond/rec4UqStf9WUVif1f
  8 PASSED        127      280      407 E        E        SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8
  9 FAILED        633     2048     2681 26       588      AIME2025/aime2025-02
 10 PASSED        182     1080     1262 B        B        GPQA Diamond/recgI6tUQ7RLJRWGx
 11 PASSED        137      232      369 A        A        SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09
 12 PASSED        165     1437     1602 16       16       AIME2025/aime2025-03

Implements the Responses API endpoint that Codex CLI (and other modern OpenAI tooling) speaks instead of /v1/chat/completions. The wire format is documented in OpenAI's Responses API; this implementation has been iterated against the Codex CLI binary's SSE parser shape until no remaining schema gaps were found. Request parsing (parse_responses_request, parse_responses_input): - Accepts the typed input array (message, function_call, function_call_output, reasoning, custom_tool_call(_output), local_shell_call(_output), web_search_call(_output), tool_search_call(_output), image_generation_call(_output), compaction, context_compaction). - Maps hosted-tool history to function_call/function_call_output so prior actions survive across turns; rejects unknown item types and non-completed status with 400 to avoid silent context loss. - Strict content-array parsing: only string|null|array of recognized text blocks (input_text/output_text/text/summary_text/ reasoning_text); rejects non-text modalities (input_image/file/ audio) instead of accepting an empty prompt. - Merges adjacent function_call items into the preceding assistant message so text + tool-call turns render as a single assistant block. - Honors reasoning.effort (incl. "minimal"/"none") and gates reasoning summary surface on reasoning.summary opt-in. - Rejects previous_response_id, conversation, and forced tool_choice explicitly (constrained decoding / persisted state not supported). Output (responses_sse_*, responses_final_response): - Emits the full streaming lifecycle: response.created, output_item.added/.done, reasoning_summary_part.added/.done, reasoning_summary_text.delta/.done, content_part.added/.done, output_text.delta/.done, function_call_arguments.delta/.done, response.completed. - Branches the terminal event by finish reason: response.failed for errors and response.incomplete with reason "max_tokens" for length. - Every event carries sequence_number; every output_text part carries annotations:[]; function_call output_item.added ships with an empty arguments string (full args arrive via function_call_arguments.done and output_item.done), and item ids are stable across added/done. - Tracks whether </think> was actually observed so a truncated stream marks the reasoning item incomplete instead of "completed". - Recovers gracefully when the DSML tool parse fails after the model was suppressed at the tool marker: the suppressed tail is flushed as additional output_text deltas so the streamed message matches output_item.done. Tested by 25 rounds of /codex:adversarial-review against the same client this is meant to feed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks. Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.

Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.

Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.

Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.

Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation. The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance. Tested on 0.180 with: - make cpu - make -B cuda-spark - make cuda-regression - ./ds4_test --server --metal-kernels - ./ds4_test --logprob-vectors --tool-call-quality - ds4-bench ctx-alloc 32768, 250000, and 1000000 - ds4-server --ctx 1000000 startup smoke (cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)

Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first. Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations. Fixes antirez#127.

Return a 400 error with error type "context_exceeded" when prompt tokens exceed context size. The response includes both n_prompt_tokens and n_ctx fields so clients can determine exactly why the request failed and how far over the limit they went. Error response format: { "error": { "message": "Prompt tokens (N) exceeds context size (M)", "type": "context_exceeded", "n_prompt_tokens": N, "n_ctx": M } }

dwarfstar is typoed to drawfstar

fix typo in readme

This reverts commit 9ca9013.

This reverts commit 805368e, reversing changes made to e8e8779.

Add coordinator/worker distributed layer execution, pipelined prefill, worker routing, telemetry, activation transport width, and KV mismatch recovery for DeepSeek Flash/Pro.

DGX Spark (GB10, sm_121, 121 GiB UMA, driver 580+) sits in an unusual spot for CUDA inference: ATS (Address Translation Service) lets the GPU consume host-mmap'd weights directly, but at significantly lower effective bandwidth than HBM-resident copies. For an 80 GB IQ2XXS DeepSeek V4 Flash checkpoint, the difference is the model running versus the model being usable. This commit adds: - Startup HBM cache that copies hot tensor spans (attn projections, MoE shared experts, output projection) into device memory at engine init, capped by a configurable budget (defaults sized to leave headroom for KV cache and a second model load). Cold MoE routed experts stay ATS-mapped. - Factored the cudaMalloc+memcpy populate path into a helper and reordered cuda_model_range_ptr so the HBM-resident lookup is a single hash-keyed read that wins over the UVA-mapped pointer on the hot decode path. - GPU argmax kernel; the prior fallback misused indexer scoring as an argmax which double-paid the dispatcher cost on N=1 decode. - Pair-fused Q_A + KV_A matmuls in qkv_rms_fused decode path (one shared weight load per row, two outputs). - Parallelized matmul_q8_0_hc_expand epilogue across n_hc lanes (n_hc parallel residual loads + writes vs n_hc^2 serial reads). - HBM cache also populated for the MTP support model. - Drop `cudaHostRegisterReadOnly` flag — unsupported on GB10. - Drop `!mtp_ready` gate from accelerator_cache_model_tensors so the MTP support model gets the same HBM-cache treatment. Bench (DGX Spark / GB10, ds4flash, n=256, "knight" prompt, 3-run mean): Plain decode before: ~13.9 t/s (ATS-mapped weights, all paths) Plain decode after: ~16.13 t/s (HBM-resident hot spans + small-N kernel fuses) Adds `speed-bench/gb10.csv` per CONTRIBUTING.md convention so the 2048..65536 sweep is preserved alongside the existing m2_ultra.csv and m4_max.csv. Generated via: ./ds4-bench -m ds4flash.gguf \ --prompt-file speed-bench/promessi_sposi.txt \ --ctx-start 2048 --ctx-max 65536 --step-incr 2048 \ --gen-tokens 128 --csv speed-bench/gb10.csv Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0 Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf

# Conflicts: # Makefile

(cherry picked from commit e00ad3085c8edbd6c98a50ba4ad49a66c2b23984)

(cherry picked from commit 0b3efaf86f61421330e90629508adbd6228b4a8b)

``` $ ./ds4_test long-context: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics ds4-test: long-context prefill 0/30474 ds4-test: long-context prefill 8192/30474 ds4-test: long-context prefill 16384/30474 ds4-test: long-context prefill 24576/30474 ds4-test: long-context prefill 30474/30474 long-context: OK tool-call-quality: ds4-test: tool-call quality fast path ds4-test: tool-call quality exact path ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics tool-call-quality: OK logprob-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics ds4-test: vector short_italian_fact ds4-test: vector short_code_completion ds4-test: vector short_reasoning_plain ds4-test: vector long_memory_archive skipped (API/official graph mismatch) ds4-test: vector long_code_audit logprob-vectors: OK metal-kernels: ds4: CUDA registered 0.00 GiB model mapping for device access metal-kernels: OK server: server: OK ds4 tests: ok ``` ``` $ ./ds4-eval -m ds4flash.gguf --plain --questions 12 --tokens 2048 --temp 0 --seed 1 ... PASSED got 16 expected 16 (159.8s, 1437 tokens) ds4-eval: 10/12 passed, 2 failed, runtime 00h:27m # state prompt gen total given correct test 1 PASSED 201 1661 1862 B B GPQA Diamond/recNu3MXkvWUzHZr9 2 PASSED 149 370 519 C C SuperGPQA/001b51d76b4d422988f2c11f104a2c6c 3 PASSED 81 623 704 70 70 AIME2025/aime2025-01 4 FAILED 313 2048 2361 A C GPQA Diamond/recoiTJPGUmzAkief 5 PASSED 272 2048 2320 J J SuperGPQA/b7e20eac98764fb0bf30e8366d951daa 6 PASSED 146 1325 1471 468 468 AIME2025/aime2025-16 7 PASSED 156 1303 1459 B B GPQA Diamond/rec4UqStf9WUVif1f 8 PASSED 127 280 407 E E SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8 9 FAILED 633 2048 2681 26 588 AIME2025/aime2025-02 10 PASSED 182 1080 1262 B B GPQA Diamond/recgI6tUQ7RLJRWGx 11 PASSED 137 232 369 A A SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09 12 PASSED 165 1437 1602 16 16 AIME2025/aime2025-03 ```

alantsev · 2026-05-29T18:10:42Z

rebased on top of the current main
added minimal gfx1151 specific optimisation to boost genration to 13+ t/s

$ ./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128

ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: cuda backend initialized for graph diagnostics
ds4-bench: context buffers 1742.43 MiB (ctx=65665, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=16418)
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,84.61,128,13.13,52184460
4096,2048,82.94,128,11.14,80373132
^C


$ ./ds4_test
long-context:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: cuda backend initialized for graph diagnostics
ds4-test: long-context prefill 0/30474
ds4-test: long-context prefill 8192/30474
ds4-test: long-context prefill 16384/30474
ds4-test: long-context prefill 24576/30474
ds4-test: long-context prefill 30474/30474
long-context: OK
tool-call-quality:
ds4-test: tool-call quality fast path
ds4-test: tool-call quality exact path
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: cuda backend initialized for graph diagnostics
tool-call-quality: OK
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive skipped (API/official graph mismatch)
ds4-test: vector long_code_audit
logprob-vectors: OK
local-golden-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: cuda backend initialized for graph diagnostics
ds4-test: local golden long_story_4096 top1 ref=4371 cand=4371 top5_overlap=5/5 top20_overlap=17/20 top64_overlap=55/64 top20_max_abs=2.02672
local-golden-vectors: OK
metal-short-prefill:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: cuda backend initialized for graph diagnostics
metal-short-prefill: OK
metal-kernels:
ds4: CUDA registered 0.00 GiB model mapping for device access
ds4: CUDA registered 0.00 GiB model mapping for device access
ds4: CUDA registered 0.00 GiB model mapping for device access
metal-kernels: OK
metal-tensor-equivalence:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: cuda backend initialized for graph diagnostics
ds4-test: Tensor equivalence candidate route=auto
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: cuda backend initialized for graph diagnostics
ds4-test: Tensor equivalence short_italian_fact top1 ref=108149 cand=108149 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_italian_fact largest deltas: id=0 ref=-16.7933 cand=-16.7933 abs=0 id=1 ref=20.1809 cand=20.1809 abs=0 id=2 ref=-57.0803 cand=-57.0803 abs=0 id=3 ref=17.8732 cand=17.8732 abs=0 id=4 ref=27.5367 cand=27.5367 abs=0
ds4-test: Tensor equivalence short_code_completion top1 ref=9854 cand=9854 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_code_completion largest deltas: id=0 ref=-4.79073 cand=-4.79073 abs=0 id=1 ref=21.6964 cand=21.6964 abs=0 id=2 ref=-47.264 cand=-47.264 abs=0 id=3 ref=10.8016 cand=10.8016 abs=0 id=4 ref=25.4716 cand=25.4716 abs=0
ds4-test: Tensor equivalence short_reasoning_plain top1 ref=926 cand=926 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0
ds4-test: Tensor equivalence short_reasoning_plain largest deltas: id=0 ref=-2.59292 cand=-2.59292 abs=0 id=1 ref=22.9133 cand=22.9133 abs=0 id=2 ref=-43.2019 cand=-43.2019 abs=0 id=3 ref=15.7734 cand=15.7734 abs=0 id=4 ref=18.2225 cand=18.2225 abs=0
ds4-test: Tensor equivalence long_memory_archive top1 ref=32111 cand=32111 top5_overlap=3/5 overlap=18/20 max_rank_delta=5 rms=0.996669 max_abs=3.80748 top20_max_abs=2.21095
ds4-test: Tensor equivalence long_memory_archive largest deltas: id=103758 ref=-8.21769 cand=-4.4102 abs=3.80748 id=1335 ref=9.72559 cand=5.94716 abs=3.77842 id=25160 ref=-8.34078 cand=-4.60154 abs=3.73924 id=24300 ref=-12.4001 cand=-8.69963 abs=3.70044 id=3413 ref=14.2124 cand=10.5636 abs=3.64885
ds4-test: Tensor equivalence long_code_audit top1 ref=671 cand=671 top5_overlap=5/5 overlap=18/20 max_rank_delta=5 rms=0.466425 max_abs=2.19618 top20_max_abs=1.04793
ds4-test: Tensor equivalence long_code_audit largest deltas: id=84028 ref=-12.8415 cand=-15.0377 abs=2.19618 id=104937 ref=0.399135 cand=-1.74162 abs=2.14075 id=28179 ref=4.85577 cand=2.71859 abs=2.13718 id=79754 ref=4.41424 cand=2.33946 abs=2.07478 id=124695 ref=8.06731 cand=10.1345 abs=2.06717
ds4-test: Tensor summary route=auto cases=5 capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 min_top5_overlap=3/5 min_overlap=18/20 worst_rank_delta=5 worst_rms=0.996669 worst_max_abs=3.80748 worst_top20_max_abs=2.21095
metal-tensor-equivalence: OK
server:
server: OK
ds4 tests: ok

``` $ ./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: cuda backend initialized for graph diagnostics ds4-bench: context buffers 1742.43 MiB (ctx=65665, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=16418) ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes 2048,2048,84.61,128,13.13,52184460 4096,2048,82.94,128,11.14,80373132 ^C ``` ``` $ ./ds4_test long-context: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: cuda backend initialized for graph diagnostics ds4-test: long-context prefill 0/30474 ds4-test: long-context prefill 8192/30474 ds4-test: long-context prefill 16384/30474 ds4-test: long-context prefill 24576/30474 ds4-test: long-context prefill 30474/30474 long-context: OK tool-call-quality: ds4-test: tool-call quality fast path ds4-test: tool-call quality exact path ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: cuda backend initialized for graph diagnostics tool-call-quality: OK logprob-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: cuda backend initialized for graph diagnostics ds4-test: vector short_italian_fact ds4-test: vector short_code_completion ds4-test: vector short_reasoning_plain ds4-test: vector long_memory_archive skipped (API/official graph mismatch) ds4-test: vector long_code_audit logprob-vectors: OK local-golden-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: cuda backend initialized for graph diagnostics ds4-test: local golden long_story_4096 top1 ref=4371 cand=4371 top5_overlap=5/5 top20_overlap=17/20 top64_overlap=55/64 top20_max_abs=2.02672 local-golden-vectors: OK metal-short-prefill: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: cuda backend initialized for graph diagnostics metal-short-prefill: OK metal-kernels: ds4: CUDA registered 0.00 GiB model mapping for device access ds4: CUDA registered 0.00 GiB model mapping for device access ds4: CUDA registered 0.00 GiB model mapping for device access metal-kernels: OK metal-tensor-equivalence: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: cuda backend initialized for graph diagnostics ds4-test: Tensor equivalence candidate route=auto ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: cuda backend initialized for graph diagnostics ds4-test: Tensor equivalence short_italian_fact top1 ref=108149 cand=108149 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_italian_fact largest deltas: id=0 ref=-16.7933 cand=-16.7933 abs=0 id=1 ref=20.1809 cand=20.1809 abs=0 id=2 ref=-57.0803 cand=-57.0803 abs=0 id=3 ref=17.8732 cand=17.8732 abs=0 id=4 ref=27.5367 cand=27.5367 abs=0 ds4-test: Tensor equivalence short_code_completion top1 ref=9854 cand=9854 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_code_completion largest deltas: id=0 ref=-4.79073 cand=-4.79073 abs=0 id=1 ref=21.6964 cand=21.6964 abs=0 id=2 ref=-47.264 cand=-47.264 abs=0 id=3 ref=10.8016 cand=10.8016 abs=0 id=4 ref=25.4716 cand=25.4716 abs=0 ds4-test: Tensor equivalence short_reasoning_plain top1 ref=926 cand=926 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_reasoning_plain largest deltas: id=0 ref=-2.59292 cand=-2.59292 abs=0 id=1 ref=22.9133 cand=22.9133 abs=0 id=2 ref=-43.2019 cand=-43.2019 abs=0 id=3 ref=15.7734 cand=15.7734 abs=0 id=4 ref=18.2225 cand=18.2225 abs=0 ds4-test: Tensor equivalence long_memory_archive top1 ref=32111 cand=32111 top5_overlap=3/5 overlap=18/20 max_rank_delta=5 rms=0.996669 max_abs=3.80748 top20_max_abs=2.21095 ds4-test: Tensor equivalence long_memory_archive largest deltas: id=103758 ref=-8.21769 cand=-4.4102 abs=3.80748 id=1335 ref=9.72559 cand=5.94716 abs=3.77842 id=25160 ref=-8.34078 cand=-4.60154 abs=3.73924 id=24300 ref=-12.4001 cand=-8.69963 abs=3.70044 id=3413 ref=14.2124 cand=10.5636 abs=3.64885 ds4-test: Tensor equivalence long_code_audit top1 ref=671 cand=671 top5_overlap=5/5 overlap=18/20 max_rank_delta=5 rms=0.466425 max_abs=2.19618 top20_max_abs=1.04793 ds4-test: Tensor equivalence long_code_audit largest deltas: id=84028 ref=-12.8415 cand=-15.0377 abs=2.19618 id=104937 ref=0.399135 cand=-1.74162 abs=2.14075 id=28179 ref=4.85577 cand=2.71859 abs=2.13718 id=79754 ref=4.41424 cand=2.33946 abs=2.07478 id=124695 ref=8.06731 cand=10.1345 abs=2.06717 ds4-test: Tensor summary route=auto cases=5 capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 min_top5_overlap=3/5 min_overlap=18/20 worst_rank_delta=5 worst_rms=0.996669 worst_max_abs=3.80748 worst_top20_max_abs=2.21095 metal-tensor-equivalence: OK server: server: OK ds4 tests: ok ```

alantsev · 2026-05-29T19:21:40Z

found a bag in the submitted kernel - fixing it breaks the logprob test

$ ./ds4_test --logprob-vectors
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_code_completion step 1 selected token mismatch
tests/ds4_test.c:808: assertion failed: false
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive skipped (API/official graph mismatch)
ds4-test: vector long_code_audit
logprob-vectors: ERR
ds4 tests: 1 failure(s)

I will close this PR and submit another one after fixing the issue

mitsuhiko and others added 30 commits May 11, 2026 12:30

feat(server): report KV cache usage

0ca2e28

feat(server): report Anthropic cache usage

38800bf

README: separate motivations.

c5ef7ac

Merge branch 'pr-91-responses' into responses-api

2174611

Tighten Responses tool_search replay

6396966

Fix Responses tool checkpoint cache reuse

a01bf1d

Fix Responses API live continuation

acb40bf

metal: cover q4 expert tensors in model views

2a7a5f3

Skip tool checkpoint canonicalization for exact DSML replay

b4c5f7c

Merge responses-api

e88a71e

Use visible live checkpoints for toolless thinking

5453ad0

Clarify server progress logs

646798f

Add Anthropic live tool continuation

43535e1

Revert "metal: cover q4 expert tensors in model views"

67e6146

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Tag Responses API server logs

0083475

Recover Responses replays without hidden reasoning

0610591

Stream Anthropic tool calls live

94c1f38

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

fix typo in readme

741d0cc

dwarfstar is typoed to drawfstar

Merge pull request antirez#155 from kernelzeroday/main

98593ec

fix typo in readme

Fix typos in README.md

f6fa52b

Merge branch 'pr-150-context-error' into merge-pr-150-standard-context

157873b

antirez added 3 commits May 27, 2026 16:11

Harden wide MoE tile dispatch

9ca9013

Revert "Harden wide MoE tile dispatch"

f183c19

This reverts commit 9ca9013.

Revert "Merge PR antirez#264: Add wide-token MoE prefill tiles"

072bc0f

This reverts commit 805368e, reversing changes made to e8e8779.

alantsev force-pushed the rocm branch from 4a72f50 to 7382cf1 Compare May 28, 2026 08:13

antirez and others added 3 commits May 28, 2026 17:26

Add local golden inference drift test

17502b9

Guard MoE Metal tile shape

22393e7

merge from main@upstream

83f8043

alantsev force-pushed the rocm branch from 7382cf1 to 55988ba Compare May 28, 2026 16:06

antirez and others added 13 commits May 29, 2026 11:13

Avoid duplicate CLI prefill completion lines

34cd76a

Add distributed inference

abdc807

Add coordinator/worker distributed layer execution, pipelined prefill, worker routing, telemetry, activation transport width, and KV mismatch recovery for DeepSeek Flash/Pro.

cuda: gate Spark HBM cache to cuda-spark builds

bf3ff6d

Add distributed KV checkpoint support

4844d55

Merge branch 'distributed'

ae5fbf1

# Conflicts: # Makefile

cuda: implement model map span API

caa60f2

(cherry picked from commit e00ad3085c8edbd6c98a50ba4ad49a66c2b23984)

build: fix current compiler warnings

fec25da

(cherry picked from commit 0b3efaf86f61421330e90629508adbd6228b4a8b)

merge from main@upstream

69b77c7

enable agent for rocm

a87137a

simplify Makefile

c5b9e72

ds4-agent experiment - add multi-platform nix flake

cdf6fe7

alantsev force-pushed the rocm branch from 55988ba to a8992bf Compare May 29, 2026 18:08

alantsev changed the title ~~rocm: enable wmma indexer support~~ rocm: enable wmma indexer support + nix flake + gfx1151 specific optimisation May 29, 2026

alantsev changed the title ~~rocm: enable wmma indexer support + nix flake + gfx1151 specific optimisation~~ rocm: enable wmma indexer + nix flake + gfx1151 optimisation May 29, 2026

alantsev force-pushed the rocm branch from a8992bf to 42a15ee Compare May 29, 2026 19:12

alantsev closed this May 29, 2026

alantsev mentioned this pull request May 30, 2026

rocm - rebased on top of the current main branch, nix build, changes to the rocm version of the kernel #290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm: enable wmma indexer + nix flake + gfx1151 optimisation#180

rocm: enable wmma indexer + nix flake + gfx1151 optimisation#180
alantsev wants to merge 185 commits into
antirez:rocmfrom
alantsev:rocm

alantsev commented May 17, 2026 •

edited

Loading

Uh oh!

alantsev commented May 29, 2026

Uh oh!

alantsev commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

alantsev commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alantsev commented May 29, 2026

Uh oh!

alantsev commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

alantsev commented May 17, 2026 •

edited

Loading